Extracting Features from Textual Data in Class Imbalance Problems

نویسندگان

چکیده

We address class imbalance problems. These are classification problems where the target variable is binary, and one dominates over other. A central objective in these to identify features that yield models with high precision/recall values, standard yardsticks for assessing such models. Our extracted from textual data inherent use n-gram frequencies as introduce a discrepancy score measures efficacy of an highlighting minority class. The frequency counts n-grams highest scores used construct desired metrics. According best practices followed by services industry, many customer support tickets will get audited tagged “contract-compliant” whereas some be “over-delivered”. Based on in-field data, we random forest classifier perform randomized grid search model hyperparameters. scoring performed using function. minimize follow-up costs optimizing recall while maintaining base-level precision score. final optimized achieves acceptable staying above precision. validate our feature selection method comparing constructed chosen randomly. propose extensions extraction general (binary multi-class) regression measure dissimilarity distributions other (more general) formulate could potentially more effective

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting Information from Citeseer’s Textual Data

This article deals with CiteSeer, a free online digital library and search engine of mainly computer science research papers. First, it discusses CiteSeer’s features and structure and then it presents what useful information on publications and author collaborations can be extracted from its textual data. We show the basic properties of both the publication citation and author citation graph. M...

متن کامل

Extracting Predictor Variables to Construct Breast Cancer Survivability Model with Class Imbalance Problem

Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated with breast cancer survival, the accuracy of survival prognosis models is a challenging issue...

متن کامل

Extracting Coactivated Features from Multiple Data Sets

We present a nonlinear generalization of Canonical Correlation Analysis (CCA) to find related structure in multiple data sets. The new method allows to analyze an arbitrary number of data sets, and the extracted features capture higher-order statistical dependencies. The features are independent components that are coupled across the data sets. The coupling takes the form of coactivation (depen...

متن کامل

Breast Cancer Diagnosis from Perspective of Class Imbalance

Introduction: Breast cancer is the second cause of mortality among women. Early detection is the only rescue to reduce the risk of breast cancer mortality. Traditional methods cannot effectively diagnose tumor since they are based on the assumption of well-balanced dataset.. However, a hybrid method can help to alleviate the two-class imbalance problem existing in the ...

متن کامل

Class Imbalance Problem in Data Mining Review

In last few years there are major changes and evolution has been done on classification of data. As the application area of technology is increases the size of data also increases. Classification of data becomes difficult because of unbounded size and imbalance nature of data. Class imbalance problem become greatest issue in data mining. Imbalance problem occur where one of the two classes havi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of computer-assisted linguistic research

سال: 2022

ISSN: ['2530-9455']

DOI: https://doi.org/10.4995/jclr.2022.18200